Combining Segmenter and Chunker for Chinese Word Segmentation

نویسندگان

  • Masayuki Asahara
  • Chooi-Ling Goh
  • Xiaojie Wang
  • Yuji Matsumoto
چکیده

Our proposed method is to use a Hidden Markov Model-based word segmenter and a Support Vector Machine-based chunker for Chinese word segmentation. Firstly, input sentences are analyzed by the Hidden Markov Model-based word segmenter. The word seg-menter produces n-best word candidates together with some class information and confidence measures. Secondly, the extracted words are broken into character units and each character is annotated with the possible word class and the position in the word, which are then used as the features for the chunker. Finally, the Support Vector Machine-based chunker brings character units together into words so as to determine the word boundaries. 1 Methods We participate in the closed test for all four sets of data in Chinese Word Segmentation Bakeoff. Our method is based on the following two steps: 1. The input sentence is segmented into a word sequence by Hidden Markov Model-based word seg-menter. The segmenter assigns a word class with a confidence measure for each word at the hidden states. The model is trained by Baum-Welch algorithm. 2. Each character in the sentence is annotated with the word class tag and the position in the word. The n-best word candidates derived from the word seg-menter are also extracted as the features. A support vector machine-based chunker corrects the errors made by the segmenter using the extracted features. We will describe each of these steps in more details. Our word segmenter is based on Hidden Markov Model (HMM). We first decide the number of hidden states (classes) and assume that the each word can belong to all the classes with some probability. The problem is defined as a search for the sequence of word classes C = c 1 ,. .. , c n given a word sequence W = w 1 ,. .. , w n. The target is to find W and C for a given input S that maximizes the following probability: arg max W,C P (W |C)P (C) We assume that the word probability P (W |C) is constrained only by its word class, and that the class probability P (C) is constrained only by the class of the preceding word. These probabilities are estimated by the Baum-Welch algorithm using the training material (See (Manning and Schütze., 1999)). The learning process is based on the Baum-Welch algorithm and is the same as the well-known use of HMM for part-of-speech tagging problem, except that …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Unified Framework for Text Analysis in Chinese TTS

This paper presents a robust text analysis system for Chinese text-tospeech synthesis. In this study, a lexicon word or a continuum of non-hanzi characters with the same category (e.g. a digit string) are defined as a morpheme, which is the basic unit forming a Chinese word. Based on this definition, the three key issues concerning the interpretation of real Chinese text, namely lexical disambi...

متن کامل

A Maximum Entropy Approach to Chinese Word Segmentation

We participated in the Second International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSR), and Peking University (PKU). Based on a maximum entropy approach, our word segmenter achieved the highest F measure for AS, CITYU, ...

متن کامل

Towards a Hybrid Model for Chinese Word Segmentation

This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...

متن کامل

Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation

Unknown words and word segmentation granularity are two main problems in Chinese word segmentation for ChineseJapanese Machine Translation (MT). In this paper, we propose an approach of exploiting common Chinese characters shared between Chinese and Japanese in Chinese word segmentation optimization for MT aiming to solve these problems. We augment the system dictionary of a Chinese segmenter b...

متن کامل

High OOV-Recall Chinese Word Segmenter

For the competition of Chinese word segmentation held in the first CIPS-SIGHNA joint conference. We applied a subwordbased word segmenter using CRFs and extended the segmenter with OOV words recognized by Accessor Variety. Moreover, we proposed several post-processing rules to improve the performance. Our system achieved promising OOV recall among all the participants.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003